Background

In the second lecture, we learned that JSDMs relax the assumption of species’ independence. This is one of the more critical assumptions made in VGLMMs, that is ecologically unrealistic. It is unrealistic, because we often expect species to co-occur, which results in positive correlation, or the opposite of co-occurring (avoidance) results in negative correlation between species.

These (residual) correlations are difficult to interpret. They can be caused by species interactions, but are confounded with all other sources of unmeasured variation. For example, if we forget to include an important covariate in the model, this will also result in residual correlations. That makes that the associations are useful for improving prediction, but perhaps not so much for inference?

Technically, JSDMs are models for binary data: presence-absence of species. However, we can generalize this to other data types and responses by keeping the model on the link-scale the same. In the gllvm R-package it is pretty straightforward: we just change the family argument to (for example) “Poisson”, “ordinal”, “beta” or “tweedie” (see ?gllvm). In this exercise, we will fit JSDMs to binary data.

Data

We will explore fitting JSDMs using a binary dataset of alpine plants by D’Amen et al. (2018) , used to demonstrate GLLVMs by van der Veen et al. (2021). The data is available in the github repository as “Alpine”, in the data folder. We can read it in as follows:

Y <- read.csv("../../data/alpineY.csv")[,-1]
X <- read.csv("../../data/alpineX.csv")[,-1]
X <- data.frame(lapply(X, function(x)if(is.numeric(x)){scale(x)}else{as.factor(x)}))

you might have to change the exact working directory to make it work for your exact set-up. A good place to start an analysis, is to first examine the data a bit more, so we know what we are dealing with:

dim(Y)
## [1] 912 175
colnames(X)
## [1] "X"      "Y"      "DDEG0"  "SLOPE"  "MIND"   "SOLRAD" "TPI"

The data are presence-absences, and there are 912 rows and 175 columns (species). Because we will be modelling species-specific responses, we should ensure that each species has enough observations in the data. There area large number of rows, and although the method can technically accommodate sites without any observations, we can speed things up a little by removing the ones that are empty.

min(colSums(Y))
## [1] 22
table(rowSums(ifelse(Y==0,0,1))>3)
## 
## FALSE  TRUE 
##   142   770

From the 175 species, all have at least 22 observations. This is because D’Amen et al. already filtered the species and removed the ones with fewer than 22 presences. Removing species with few presences is a personal decision: it can considerably speed up model fitting and improve convergence (parameters for species with little data are often difficult to estimate), but of course we lose vital information. From the perspective of ordination, as covered in the lecture, we might want to retain species with few observations when they add vital information about the range or limits of the ecological gradient. However, here we are fitting JSDMs, and we take a different angle: the parameters of species with few observations cannot be accurately estimated anyway, so we might as well get rid of them!

We also see that 72 rows in the data have no information, so we get rid of those.

X <- X[rowSums(Y)>0, ]
Y <- Y[rowSums(Y)>0,]

Fitting a Joint Species Distribution Model

The modeling goes as before, with the gllvm function. However, now we also use num.lv, which stands for the number of latent variables added to the model. Here, we fit JSDMs using the “factor-analytic” approach, or latent variable modeling. Few latent variables makes for a fast model, but potentially poor estimation of the correlation of species. So, we are left with trade-off: wait for a long time for an accurate estimate, or quickly get something slightly less accurate.

A good place to start, is two latent variables (the default). We can also add covariates, or random effects, to the model, but we will start without. We will use a few arguments to speed up the model, as well as fitting the model in parallel (TIP: you might want to open a task manager os ystem monitor to keep an eye on your computer’s resources). The argument Lambda.struc simplifies the approximation to the likelihood and should be used cautiously, the argument sd.errors turns off calculation of standard errors as that can take longer than the actual model fitting at times, optim.method selects the numerical optimisation algorithm, because “L-BFGS-B” is often much faster with parallelisation than the default “BFGS”. All together, this significantly reduces the time to fit the model from tens of minutes to about half a minute (with 7 CPU). If you work on a laptop, ensure that your battery settings are not set to “balanced”, as this too will slow model fitting.

library(gllvm)
TMB::openmp(parallel::detectCores()-1, DLL = "gllvm", autopar = TRUE)
## gllvm 
##     7 
## attr(,"autopar")
## [1] TRUE
model1  <- gllvm(y = Y, num.lv = 2, family = "binomial", Lambda.struc = "diagonal", sd.errors = FALSE, optim.method = "L-BFGS-B")

Turning off standard error calculation is often a good idea when in gllvm, when working with large datasets or complex models. When we have decided on our “final” model, we can post-hoc calculate the standard errors using the se.gllvm function (an example is shown on the help page of that function). For now, let’s visualize the estimated associations:

## gllvm 
##     7 
## attr(,"autopar")
## [1] TRUE
corrplot::corrplot(getResidualCor(model1), type = "lower", order = "AOE", diag = FALSE, tl.pos = "l", tl.cex = 0.2, addgrid.col = NA)

The corrplot package helps us to order the plot so we might identify patterns. As you might realize at this point, it is -very- difficult to make ecological sense of these associations (partly because there are so many, and now we are “only” working with 175 species).

There are two directions we can take this now: add species-specific effects (fixed or random), or species-common effects (fixed or random), or try to go through a procedure of selecting the optimal number of latent variables to represent the associations, with AIC, BIC, goodnessOfFit. ?goodnessOfFit helps you to calculate some metrics for the predictive performance of the model, such as Tjur’s \(R^2\):

goodnessOfFit(Y, object = model1, measure = "TjurR2")
## $TjurR2
## [1] 0.3509823

this is used in distribution modeling to quantify “discriminative power”; if observations (presence or absence) can perfectly be classified by the model, it will equal 1. With more latent variables, or by adding covariates, it will probably improve.

Spatial row effect

The data includes longitude and latitude columns, as it is specially explicit. The package offers the options for incorporating spatial random row effects to account for autocorrelation that may exist, but these are not particularly fast to implement (yet) and is a work in progress. I do not recommend doing it in this exercise, the following code serves as reference in case you need it for your analysis:

model2  <- gllvm(y = Y, num.lv = 2, family = "binomial", Lambda.struc = "diagonal", sd.errors = FALSE, optim.method = "L-BFGS-B", row.eff = ~corExp(1|site), studyDesign = data.frame(site = 1:nrow(Y)), dist = list(as.matrix(X[,c("X","Y")])))

Part 1

The goal is to get familiar with Joint Species Distribution Modeling, and the output that the gllvm R-package provides in that context.

Here is what I want you to do:

  1. Explore the model with latent variables displayed above; fit it in your own R session, extract the associations, and visualize them with the corrplot function
  2. Fit models with different number of latent variables, and perform comparison. Here, I want you to use goodnessOfFit to try and find the model with the highest discriminative power
  3. Combine the latent variable approach with what you learned in the previous exercise: add some of the covariates (“DDEG0”: growing degree days above zero, “SLOPE”, “MIND”: moisture index, “SOLRAD”: total solar radiation in a year, “TPI”: a topography index)
  4. Use coefplot, randomCoefPlot, summary to try and draw conclusions about the main drivers of species’ presence for this dataset

If there is enough time, we will discuss as a group.

Part 2

Please be advised that the models 4 and 5 in this part are considerably more complex and take time to fit.

The above JSDM does not incorporate traits, or phylogeny. Both of these can help to improve our prediction of species’ distribution. The alpine plant dataset does not include either, so we will use a different dataset that does, for this second part. It is included in the package already, which makes loading it a bit easier.

data(fungi, package = "gllvm")
Y2 <- fungi$Y
X2 <- fungi$X
X2 <- data.frame(lapply(X2, function(x)if(is.numeric(x)){scale(x)}else{as.factor(x)}))
tree <- fungi$tree
covMat<- ape::vcv(tree)
distMat <- ape::vcv(tree)
TR <- fungi$TR
colnames(TR)[8] <- "Sp.log.vol" # funky column name needs to be changed

We again explore the data a little:

any(rowSums(Y2)==0)
## [1] FALSE
dim(Y2)
## [1] 1666  215
colnames(X2)
## [1] "DBH.CM"    "AVERDP"    "CONNECT10" "TEMPR"     "PRECIP"    "log.AREA" 
## [7] "REGION"    "RESERVE"
colnames(TR)
## [1] "PC1"        "PC2"        "PC3"        "PC1.1"      "PC1.2"     
## [6] "PC1.3"      "FB.type"    "Sp.log.vol" "Lifestyle"

This a binary dataset by Abrego et al. (2022), of 215 wood inhabiting fungi inhabiting 1666 logs. TR include the trait covariates, and X includes environmental variables related to the forest, or the deadwood, that the fungi occurred in/on. First, we will fit a model without phylogeny but with traits, and including “REGION” and “RESERVE” as random row effects.

model3 <- gllvm(Y2, X = X2, TR = TR, formula = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP+
                  (DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP):(PC1.1+PC1.2+PC1.3), 
                row.eff = ~(1|REGION/RESERVE), studyDesign =  X2[,c("REGION","RESERVE")],
                num.lv = 0, family = "binomial", sd.errors = FALSE,
                optim.method = "L-BFGS-B")

This model does not include species-specific random effects yet, so should fit pretty quickly. The included “traits” are Principal Components extracted from the original trait matrix, in order to a-priori reduce complexity of the model. Be careful when playing around: with these dataset dimensions the models start using loads of RAM, and when you run out of RAM R will crash, which can quickly happen when we add more traits. In general, when working with large datasets and complex models, using a server for more computing power is adviseable.

We can examine the fourth corner coefficients:

library(lattice)
fourth <- model3$fourth.corner # the coefficients
a <- max(abs(range(fourth)))
colort <- colorRampPalette(c("blue", "white", "red"))
plot.4th <- levelplot((as.matrix(fourth)), xlab = "Environmental Variables", 
                      ylab = "Species traits", col.regions = colort(100), cex.lab = 1.3, 
                      at = seq(-a, a, length = 100), scales = list(x = list(rot = 45)))
plot.4th

Excellent! The trait-environment interaction coefficients have been included, and visualized. You can also examine them using the model its summary. With this model we assumed that species’ responses to the environment are fully determined by species’ traits. Usually, we do not know if we have measured the right traits, so incorporating “residual” information in species’ responses to the environmental covariates is vital. We relax our assumption by fitting another model (this might take a little):

model4 <- gllvm(Y2, X = X2, TR = TR, formula = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP+
                  (DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP):(PC1.1+PC1.2+PC1.3), 
                randomX = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP,
                row.eff = ~(1|REGION/RESERVE), studyDesign =  X2[,c("REGION","RESERVE")],
                num.lv = 0, family = "binomial", sd.errors = FALSE,
                optim.method = "L-BFGS-B", Ab.struct = "diagonal", maxit=1e5)

this formulation ensures that the relevant correlations between species’ random effects are incorporated, which we can also examine with summary. Before we do that, we will add the final component to our model; the phylogeny. The gllvm package broadly implements phylogenetic mixed-effects models, which can also be used to fit models for traits or individuals (i.e., where traits are the response variables), but here we focus on our species data. Phylogenetic random effects make the model considerably more computer intensive (complex), and there are a range of considerations that we need to make when fitting it. In particular, the nearest neighbour approximation and the ordering of the species. There are various ways to find the “optimal” setting for both those things, which we will not go further into here (but ask if you want to know!). For further details see the corresponding vignette in the package. For the following models, you will need at least 30 minutes computing time. Mind yourself, this is incredibly fast; it took the original authors ten days with the Hmsc R-package. For the exercises, you can reduce this further by (e.g.,) subsetting the data and fitting the models to a smaller number of species and sites.

e <- eigen(covMat)$vectors[,1]
ord <- gllvm:::findOrder(covMat = covMat, distMat = distMat, nn = 15, order = order(e))$order
spec.ord <- colnames(covMat)[ord]
model5 <- gllvm(Y2[,spec.ord], X = X2, TR = TR, formula = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP+
                  (DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP):(PC1.1+PC1.2+PC1.3), 
                randomX = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP,
                row.eff = ~(1|REGION/RESERVE), studyDesign =  X2[,c("REGION","RESERVE")],
                num.lv = 0, family = "binomial", sd.errors = FALSE,
                optim.method = "L-BFGS-B", Ab.struct = "MNdiagonal", maxit = 1e5,
                colMat = list(covMat[spec.ord, spec.ord], dist = distMat[spec.ord, spec.ord]), colMat.rho.struct = "term", nn.colMat = 15)

Let’s also calculate the standard error for our “final” model:

ses <- se(model5)
model5$sd <- ses$sd
model5$Hess <- ses$Hess

We can examine the phylogenetic signal with summary and plot the species-specific random effects (without trait effects) jointly with the phylogeny. We first need to calculate standard errors to do that:

summary(model5)
## 
## Call:
## gllvm(y = Y2[, spec.ord], X = X2, TR = TR, formula = ~DBH.CM + 
##     AVERDP + CONNECT10 + TEMPR + PRECIP + (DBH.CM + AVERDP + 
##     CONNECT10 + TEMPR + PRECIP):(PC1.1 + PC1.2 + PC1.3), family = "binomial", 
##     num.lv = 0, studyDesign = X2[, c("REGION", "RESERVE")], colMat = list(covMat[spec.ord, 
##         spec.ord], dist = distMat[spec.ord, spec.ord]), colMat.rho.struct = "term", 
##     row.eff = ~(1 | REGION/RESERVE), sd.errors = FALSE, randomX = ~DBH.CM + 
##         AVERDP + CONNECT10 + TEMPR + PRECIP, optim.method = "L-BFGS-B", 
##     Ab.struct = "MNdiagonal", maxit = 1e+05, nn.colMat = 15)
## 
## Family:  binomial 
## 
## AIC:  104177.6 AICc:  104178 BIC:  106950.3 LL:  -51832 df:  257 
## 
## Informed LVs:  0 
## Constrained LVs:  0 
## Unconstrained LVs:  0 
## 
## Formula:  ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP+DBH.CM:PC1.1+DBH.CM:PC1.2+DBH.CM:PC1.3+AVERDP:PC1.1+AVERDP:PC1.2+AVERDP:PC1.3+CONNECT10:PC1.1+CONNECT10:PC1.2+CONNECT10:PC1.3+TEMPR:PC1.1+TEMPR:PC1.2+TEMPR:PC1.3+PRECIP:PC1.1+PRECIP:PC1.2+PRECIP:PC1.3 
## LV formula:  ~ 0 
## Row effect:  ~(1 | REGION/RESERVE) 
## 
## Random effects:
##  Name      Signal Variance Std.Dev Corr                            
##  DBH.CM    0.0159 0.0021   0.0453                                  
##  AVERDP    1.0000 0.1522   0.3901   0.0021                         
##  CONNECT10 0.0000 0.0127   0.1127  -0.0274  0.0075                 
##  TEMPR     0.8186 0.0617   0.2484  -0.3826 -0.0298  0.8912         
##  PRECIP    0.7764 0.0318   0.1784   0.7568 -0.3552 -0.2760 -0.5933 
## 
## Coefficients predictors:
##                  Estimate Std. Error z value Pr(>|z|)    
## DBH.CM           0.148190   0.012541  11.817  < 2e-16 ***
## AVERDP          -0.324755   0.089667  -3.622 0.000293 ***
## CONNECT10       -0.031708   0.033889  -0.936 0.349449    
## TEMPR            0.187505   0.067485   2.778 0.005462 ** 
## PRECIP          -0.008682   0.059848  -0.145 0.884658    
## DBH.CM:PC1.1    -0.005848   0.006475  -0.903 0.366405    
## DBH.CM:PC1.2     0.009325   0.006187   1.507 0.131761    
## DBH.CM:PC1.3     0.008051   0.006488   1.241 0.214674    
## AVERDP:PC1.1    -0.044217   0.021957  -2.014 0.044025 *  
## AVERDP:PC1.2     0.021316   0.019199   1.110 0.266873    
## AVERDP:PC1.3    -0.055204   0.017764  -3.108 0.001886 ** 
## CONNECT10:PC1.1 -0.005468   0.012011  -0.455 0.648899    
## CONNECT10:PC1.2 -0.002231   0.011754  -0.190 0.849453    
## CONNECT10:PC1.3  0.010513   0.012002   0.876 0.381059    
## TEMPR:PC1.1      0.036291   0.017902   2.027 0.042639 *  
## TEMPR:PC1.2      0.014528   0.015413   0.943 0.345901    
## TEMPR:PC1.3      0.005430   0.016209   0.335 0.737653    
## PRECIP:PC1.1     0.008589   0.014090   0.610 0.542145    
## PRECIP:PC1.2     0.004515   0.012774   0.353 0.723760    
## PRECIP:PC1.3     0.030275   0.012997   2.329 0.019837 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
phyloplot(model5, tree = tree)

and plot the associations due to the environmental variables:

# There is a known bug in the package v2.0.0-2.0.1, this next line addresses that
model5$col.eff$colMat <- cov2cor(covMat)
corrplot::corrplot(getEnvironCor(model5), type = "lower", order = "AOE", diag = FALSE, tl.pos = "l", tl.cex = 0.2, addgrid.col = NA)